M.Yasoob Khalid - Web Scraping in Python 101 [EuroPython 2014] [22 July 2014]
This talk is about web scraping in Python, why web scraping is useful and what Python libraries are available to help you. I will also look into proprietary alternatives and will discuss how they work and why they are not useful. I will show you different libraries used in web scraping and some example code so that you can choose your own personal favourite. I will also tell why writing your own scrapper in scrapy allows you to have more control over the scraping process.
Who am I ? ========= * a programmer * a high school student * a blogger * Pythonista * and tea lover - Creator of freepythontips.wordpress.com - I made soundcloud-dl.appspot.com - I am a main contributor of youtube-dl. - I teach programming at my school to my friends. - It's my first programming related conference. - The life of a python programmer in Pakistan
What this talk is about ? ================== - What is Web Scraping and its usefulness - Which libraries are available for the job - Open Source vs proprietary alternatives - Whaich library is best for which job - When and when not to use Scrapy
What is Web Scraping ? ================== Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. - Wikipedia
###In simple words : It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting.
We can extract any data through web scraping which we can see while browsing the web.
Usage of web scraping in real life. ============================ - to extract product information - to extract job postings and internships - extract offers and discounts from deal-of-the-day websites - Crawl forums and social websites - Extract data to make a search engine - Gathering weather data etc
Advantages of Web scraping over using an API ======================== - Web Scraping is not rate limited - Anonymously access the website and gather data - Some websites do not have an API - Some data is not accessible through an API etc
Which libraries are available for the job ? ================================ There are numerous libraries available for web scraping in python. Each library has its own weaknesses and plus points.
Some of the most widely known libraries used for web scraping are:
- BeautifulSoup - html5lib - lxml - re ( not really for web scraping, I will explain later ) - scrapy ( a complete framework )
A comparison between these libraries ============================== - speed - ease of use - what do i prefer - which library is best for which purpose
Proprietary alternatives ================== - a list of proprietary scrapers - their price - are they really useful for you ?
Scrapy ======= - what is it - why is it useful - asynchronous support - an example scraper